Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Tower Refactor #185

Merged
merged 28 commits into from
Nov 4, 2022
Merged

Text Tower Refactor #185

merged 28 commits into from
Nov 4, 2022

Conversation

rwightman
Copy link
Collaborator

A refactor of #178 that will keep backwards compat for existing weights/models but provide a different base model for new models using custom text towers...

@rom1504
Copy link
Collaborator

rom1504 commented Oct 16, 2022

would be nice to finish this sometime @iejMac @rwightman so we can have some multilingual clip in the future

@rom1504
Copy link
Collaborator

rom1504 commented Oct 16, 2022

code looks good to me

what do we want to validate here?

Maybe something like:

  • exact inference result before/after
  • can still load with torch.load
  • training works the same before after

Anything else ?

@rom1504
Copy link
Collaborator

rom1504 commented Oct 16, 2022

ideally we could validate that with unit tests, but we could also do one shot testing

@rwightman
Copy link
Collaborator Author

@rom1504 yeah, think those tests are reasonable, I already checked some pretrained zero-shot runs. Maybe an epoch of train on a smaller model we have some recent results for... ?

@lopho
Copy link
Contributor

lopho commented Oct 17, 2022

loading openai weights now fails due to a change in model keys.
specifically the text. prefix is now gone from the text model. ex. vit-l-14:

Missing key(s) in state_dict: "positional_embedding", "text_projection", "transformer.resblocks.0.ln_1.weight", ...
Unexpected key(s) in state_dict: "text.positional_embedding", "text.text_projection", "text.transformer.resblocks.0.attn.in_proj_weight", ...

loading using the factory funcion open_clip.create_model_and_transforms and open_clip.create_model

LAION weights still load fine, and inference on text and image is the same as on the main branch. (EDIT) tested with vit-l-14 laion2b_s32b_b82k and random noise as image input and random strings 1-5 words each 1-10 symbols.
I would recommed a proper integration test over a larger input space.

@rom1504
Copy link
Collaborator

rom1504 commented Oct 17, 2022

@lopho if you're interested in writing more automated tests, it would definitely be appreciated

@lopho
Copy link
Contributor

lopho commented Oct 18, 2022

I can write tests, but at the soonest at the end of the week / weekend.
To be clear, tests to compare outputs over commits as part of the CI to catch regressions as well as for loading already published models. If I have the time I could take a look at tests for training, maybe.

…tom text model w/ existing models (for testing).
@rwightman
Copy link
Collaborator Author

@rom1504 @lopho @iejMac

I fixed some issues and pulled in some other changes so that the testing is done once

  • Loading OpenAI checkpoints was actually broken as the conversion was left in, fixed
  • I fixed pure float16 (and added pure bfloat16) support, including for OpenAI weights path (torchscript and non-jit). This has come up a few times, was broken to make AMP work for train.
  • Added a --force-custom-text to enable the custom text base model to test the dual tower configs for eval and train

@rom1504
Copy link
Collaborator

rom1504 commented Nov 1, 2022

Do you think it's ready now Ross ?

@rom1504
Copy link
Collaborator

rom1504 commented Nov 1, 2022

Code looks fine to me

Ideally we would have unit test (for inference and training) or at least check things once

But also I think it's really safe at this point, we could probably decide to just merge to avoid blocking things further

@lopho
Copy link
Contributor

lopho commented Nov 1, 2022

Yeah, sorry I haven't found the time yet to write tests.
When I come around, I will try to test premerge and merge of this branch manually.

@rwightman
Copy link
Collaborator Author

@rom1504 I've tested quite a number of zero-shot eval models w/ different precision, torchscript, custom tower enabled/disabled, and did pure bf16 train locally on 1 GPU for 1 cc12m epoch. Need to test the ResNet and timm models to see if they still work (eval for resnet, train an epoch for timm)

Should test training on a cluster w/ proper dataset, maybe few epochs of a 'B' model, not sure which cluster should use / if we have anything in a good state :/ @JeniaJitsev ?

@JeniaJitsev
Copy link
Contributor

@rwightman @rom1504 JUWELS Booster can handle runs with 32 nodes / 128 GPUs so far very robustly, no problems observed so far at all. So for testing models of 'B' scale, we are surely fine to go.

…ale. No bias on timm model projection by default. Fix null pretrained arg handling.
@rom1504
Copy link
Collaborator

rom1504 commented Nov 3, 2022

I think it's not a good idea to keep pushing more unrelated changes here.
In case we find after merging that this PR is breaking things, it will be difficult to revert.

Let's not add anything more and first validate and merge ?

@rom1504
Copy link
Collaborator

rom1504 commented Nov 3, 2022

#198 created that issue to track adding automated regression tests
Can be done after this is merged, but it will help make it much easier to merge PRs in the future

@rwightman
Copy link
Collaborator Author

I think it's not a good idea to keep pushing more unrelated changes here. In case we find after merging that this PR is breaking things, it will be difficult to revert.

Let's not add anything more and first validate and merge ?

I understand the sentiment, but with limited man power and treat coverage, and demands of verifying training at scale, easy for a feature PR to end up a release branch as I want to ensure everything is tested that's relevant to near term planned runs.

Will keep new commits to bug fix.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 3, 2022

Ok, let's try to validate it asap and merge it so that things get easier.

src/training/main.py Outdated Show resolved Hide resolved
@rwightman
Copy link
Collaborator Author

I think this is ready to merge, @mitchellnw any concern re your JUWELS run? convnext is running but hasn't been stable, seems like it might be with grad clipping enabled (which is a default for any supervised training of these models)

@mitchellnw
Copy link
Contributor

No concern so far, but unfortunately only set time for 24h so am now waiting in queue to resume.

Screen Shot 2022-11-04 at 12 01 43 PM

@rwightman
Copy link
Collaborator Author

@mitchellnw without a special exception, 24h is the per runtime limit anyways...

@rwightman
Copy link
Collaborator Author

k, I'm going to merge this, definitely possibility of some regressions but best to get this structure in main and improve/fix from there. I will hold off on pushing a pypi release until this is on main for a bit with no showstopping wtfs popping up

@rwightman rwightman merged commit 55174d7 into main Nov 4, 2022
@mitchellnw
Copy link
Contributor

all good on the b/32 run, reached 63.67%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants